mgmtd: remove bogus "hedge" code which corrupted active candidate DS by choppsv1 · Pull Request #18601 · FRRouting/frr

choppsv1 · 2025-04-08T06:12:11Z

Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long
running and is actively changing the global candidate datastore (DS),
the second one starts and exits, this code would then copy running
back over the candidate, blowing away any changes made by the first
session.

(the long running session could technically be any user)

Instead we need to trust the various cleanup code that already exits.
For example in the commit_cfg_reply on success candidate is copied to
running, and on failure for implicit commit running is copied back to
candidate clearing the change. This leaves the non-implicit
configuration changes in this case we actually want candidate to keep
it's changes in transactional cases, in the other case of pending commit
during a file read the code restores candidate (if needed) on exit from
"config terminal", with this call stack:

 vty_config_node_exit()
   nb_cli_pending_commit_check()
     nb_cli_classic_commit()
       nb_candidate_commit_prepare() [fail] -> copy running -> candidate
       nb_candidate_commit_apply() -> copy candidate -> running

fixes #18541

Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long running and is actively changing the global candidate datastore (DS), the second one starts and exits, this code would then copy running back over the candidate, blowing away any changes made by the first session. (the long running session could technically be any user) Instead we need to trust the various cleanup code that already exits. For example in the commit_cfg_reply on success candidate is copied to running, and on failure *for implicit commit* running is copied back to candidate clearing the change. This leaves the non-implicit configuration changes in this case we actually want candidate to keep it's changes in transactional cases, in the other case of pending commit during a file read the code restores candidate (if needed) on exit from "config terminal", with this call stack: vty_config_node_exit() nb_cli_pending_commit_check() nb_cli_classic_commit() nb_candidate_commit_prepare() [fail] -> copy running -> candidate nb_candidate_commit_apply() -> copy candidate -> running fixes FRRouting#18541 Signed-off-by: Christian Hopps <[email protected]>

mjstapp · 2025-04-08T12:59:57Z

I'm ... confused about why we should support multiple simultaneous config sessions. How does that work - we don't have ... sequence numbers or uuids or a way of detecting conflicts, do we? Could multiple sessions introduce inconsistencies that would lead to invalid config - could one session do some validation, and then the second introduce a change that invalidates that?

choppsv1 · 2025-04-08T14:32:52Z

@mjstapp Our normal CLI uses "implicit-commit" mode which means each change takes affect immediately, a lock is obtained immediate prior to making the change and released immediately after. For file loads locks are obtain prior to reading the file and released after reading the file is done.

This isn't really the point though. The code being remove is blowing away the candidate config w/o obtaining any lock, and without even entering into a config state (i.e., simply doing a vtysh -c 'show foo' is enough to wipe-out another session's in progress changes).

The code is ill-considered "just in case" "cleanup" code that is blatantly wrong.

mjstapp

I think we agreed in this week's dev meeting that the __ (double-underscore) functions should be renamed; I'll push this afterwards.

Having just completed a code audit during RCA, the fact that we have 2 different argument orders for the related datastore copying functions was unnecessary and super confusing. Fix this code-maintenance/comprehension mistake and move the newer mgmtd copy routines to use the same arg order as the pre-existing underlying northbound copy functions (i.e., use `copy(dst, src)`) Signed-off-by: Christian Hopps <[email protected]>

Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02

Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02  - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) FRR 10.0.1

Ported the fix FRRouting/frr#18601 from FRR community Tested using static routes, save and reboot. K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58 L failure_prs.log 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52 C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58 L> 10.59.142.159/32 is directly connected, eth0, 00:03:58 C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58 L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58 C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22 L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22 S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16 S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02  - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) FRR 10.0.1

…ity (#1675) Sync PR #22408 from sonic-net/sonic-buildimage to 202412. Original PR: sonic-net/sonic-buildimage#22408 Co-authored-by: sudhanshukumar22 <[email protected]>

frrbot bot added the mgmt FRR Management Infra label Apr 8, 2025

github-actions bot added size/M master labels Apr 8, 2025

mjstapp reviewed Apr 8, 2025

View reviewed changes

choppsv1 force-pushed the chopps/mgmtd-candidate-overwrite branch from 8eb69b5 to 59d2368 Compare April 9, 2025 10:15

github-actions bot added the rebase PR needs rebase label Apr 9, 2025

mjstapp approved these changes Apr 9, 2025

View reviewed changes

mjstapp merged commit 2aa6e78 into FRRouting:master Apr 9, 2025
13 checks passed

choppsv1 deleted the chopps/mgmtd-candidate-overwrite branch April 10, 2025 05:10

sudhanshukumar22 mentioned this pull request Apr 22, 2025

Ported the fix https://github.com/FRRouting/frr/pull/18601 from FRR community sonic-net/sonic-buildimage#22408

Merged

9 tasks

sudhanshukumar22 added a commit to sudhanshukumar22/sonic-buildimage that referenced this pull request Jun 3, 2025

Ported the fix FRRouting/frr#18601 from FRR community

a4e5914

mssonicbld mentioned this pull request Jun 18, 2025

[action] [PR:22408] Ported the fix https://github.com/FRRouting/frr/pull/18601 from FRR community sonic-net/sonic-buildimage#23021

Merged

9 tasks

r12f pushed a commit to r12f/sonic-buildimage-msft that referenced this pull request Sep 27, 2025

Cherry pick Ported the fix FRRouting/frr#18601 from FRR community

81c39b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgmtd: remove bogus "hedge" code which corrupted active candidate DS#18601

mgmtd: remove bogus "hedge" code which corrupted active candidate DS#18601
mjstapp merged 2 commits intoFRRouting:masterfrom
LabNConsulting:chopps/mgmtd-candidate-overwrite

choppsv1 commented Apr 8, 2025 •

edited

Loading

Uh oh!

mjstapp commented Apr 8, 2025

Uh oh!

choppsv1 commented Apr 8, 2025 •

edited

Loading

Uh oh!

mjstapp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

choppsv1 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjstapp commented Apr 8, 2025

Uh oh!

choppsv1 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjstapp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

choppsv1 commented Apr 8, 2025 •

edited

Loading

choppsv1 commented Apr 8, 2025 •

edited

Loading